{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "machine_shape": "hm",
      "gpuType": "A100"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "## 引言"
      ],
      "metadata": {
        "id": "orSAIpHYHPGG"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "在日常生活中，大家经常会遇到图像关键信息自动抽取的场景，比如身份证拍照上传自动识别、发票拍照上传自动报销等。\n",
        "\n",
        "\n",
        "在这个领域，现有的 AI 技术方案已经能解决一部分需求，但是依然存在一些痛点，比如发票的种类样式极其繁多，基于 OCR 文字识别+规则后处理的方案无法有效覆盖全部样式，即泛化性很差。如果要强行覆盖全部样式，成本又太高。\n",
        "\n",
        "\n",
        "针对这样的问题，开放群岛开源社区大模型SIG联合组长单位之一飞桨团队推出基于文心大模型的全新解决方案——PP-ChatOCR。本项目在飞桨团队的基础上，使用智谱AI的GLM4模型底座，实现使用GLM方案完成自定义信息抽取工作。\n",
        "\n",
        "\n",
        "本项目使用的数据样例来自互联网，侵权删。"
      ],
      "metadata": {
        "id": "1dmyShlmHRT4"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 1. 检查显卡、显存以及CUDA版本号"
      ],
      "metadata": {
        "id": "G775saPlYP2B"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!nvidia-smi"
      ],
      "metadata": {
        "id": "2tsk2mpvFpuS",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "ba9eb038-3acd-4e23-e1ff-10911a2273c9"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Sat Apr  6 10:35:43 2024       \n",
            "+---------------------------------------------------------------------------------------+\n",
            "| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |\n",
            "|-----------------------------------------+----------------------+----------------------+\n",
            "| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
            "| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |\n",
            "|                                         |                      |               MIG M. |\n",
            "|=========================================+======================+======================|\n",
            "|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |\n",
            "| N/A   31C    P0              45W / 400W |      2MiB / 40960MiB |      0%      Default |\n",
            "|                                         |                      |             Disabled |\n",
            "+-----------------------------------------+----------------------+----------------------+\n",
            "                                                                                         \n",
            "+---------------------------------------------------------------------------------------+\n",
            "| Processes:                                                                            |\n",
            "|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |\n",
            "|        ID   ID                                                             Usage      |\n",
            "|=======================================================================================|\n",
            "|  No running processes found                                                           |\n",
            "+---------------------------------------------------------------------------------------+\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 2. 安装CUDA版本对应的飞桨框架PaddlePaddle-GPU以及PaddleOCR框架，推荐用飞桨源"
      ],
      "metadata": {
        "id": "B4GR6KF7bD4M"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# 安装飞桨框架并验证\n",
        "!python -m pip install paddlepaddle-gpu==2.6.1.post120 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html\n",
        "import paddle\n",
        "paddle.utils.run_check()"
      ],
      "metadata": {
        "id": "zu_F33q2bLdo",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "846aef48-4a72-4cbd-becb-d6500d55c276"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in links: https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html\n",
            "Requirement already satisfied: paddlepaddle-gpu==2.6.1.post120 in /usr/local/lib/python3.10/dist-packages (2.6.1.post120)\n",
            "Requirement already satisfied: httpx in /usr/local/lib/python3.10/dist-packages (from paddlepaddle-gpu==2.6.1.post120) (0.27.0)\n",
            "Requirement already satisfied: numpy>=1.13 in /usr/local/lib/python3.10/dist-packages (from paddlepaddle-gpu==2.6.1.post120) (1.25.2)\n",
            "Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from paddlepaddle-gpu==2.6.1.post120) (10.0.1)\n",
            "Requirement already satisfied: decorator in /usr/local/lib/python3.10/dist-packages (from paddlepaddle-gpu==2.6.1.post120) (4.4.2)\n",
            "Requirement already satisfied: astor in /usr/local/lib/python3.10/dist-packages (from paddlepaddle-gpu==2.6.1.post120) (0.8.1)\n",
            "Requirement already satisfied: opt-einsum==3.3.0 in /usr/local/lib/python3.10/dist-packages (from paddlepaddle-gpu==2.6.1.post120) (3.3.0)\n",
            "Requirement already satisfied: protobuf>=3.20.2 in /usr/local/lib/python3.10/dist-packages (from paddlepaddle-gpu==2.6.1.post120) (3.20.3)\n",
            "Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx->paddlepaddle-gpu==2.6.1.post120) (3.7.1)\n",
            "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx->paddlepaddle-gpu==2.6.1.post120) (2024.2.2)\n",
            "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx->paddlepaddle-gpu==2.6.1.post120) (1.0.5)\n",
            "Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx->paddlepaddle-gpu==2.6.1.post120) (3.6)\n",
            "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx->paddlepaddle-gpu==2.6.1.post120) (1.3.1)\n",
            "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx->paddlepaddle-gpu==2.6.1.post120) (0.14.0)\n",
            "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx->paddlepaddle-gpu==2.6.1.post120) (1.2.0)\n",
            "Running verify PaddlePaddle program ... \n",
            "PaddlePaddle works well on 1 GPU.\n",
            "PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# 安装PaddleOCR并验证\n",
        "!pip install paddleocr\n",
        "import paddleocr\n",
        "print(paddleocr.__version__)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ugmph_xBctOL",
        "outputId": "e1672663-b066-40a1-a55b-082772be6824"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Requirement already satisfied: paddleocr in /usr/local/lib/python3.10/dist-packages (2.7.3)\n",
            "Requirement already satisfied: shapely in /usr/local/lib/python3.10/dist-packages (from paddleocr) (2.0.3)\n",
            "Requirement already satisfied: scikit-image in /usr/local/lib/python3.10/dist-packages (from paddleocr) (0.19.3)\n",
            "Requirement already satisfied: imgaug in /usr/local/lib/python3.10/dist-packages (from paddleocr) (0.4.0)\n",
            "Requirement already satisfied: pyclipper in /usr/local/lib/python3.10/dist-packages (from paddleocr) (1.3.0.post5)\n",
            "Requirement already satisfied: lmdb in /usr/local/lib/python3.10/dist-packages (from paddleocr) (1.4.1)\n",
            "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from paddleocr) (4.66.2)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from paddleocr) (1.25.2)\n",
            "Requirement already satisfied: visualdl in /usr/local/lib/python3.10/dist-packages (from paddleocr) (2.5.3)\n",
            "Requirement already satisfied: rapidfuzz in /usr/local/lib/python3.10/dist-packages (from paddleocr) (3.7.0)\n",
            "Requirement already satisfied: opencv-python<=4.6.0.66 in /usr/local/lib/python3.10/dist-packages (from paddleocr) (4.6.0.66)\n",
            "Requirement already satisfied: opencv-contrib-python<=4.6.0.66 in /usr/local/lib/python3.10/dist-packages (from paddleocr) (4.6.0.66)\n",
            "Requirement already satisfied: cython in /usr/local/lib/python3.10/dist-packages (from paddleocr) (3.0.10)\n",
            "Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from paddleocr) (4.9.4)\n",
            "Requirement already satisfied: premailer in /usr/local/lib/python3.10/dist-packages (from paddleocr) (3.10.0)\n",
            "Requirement already satisfied: openpyxl in /usr/local/lib/python3.10/dist-packages (from paddleocr) (3.1.2)\n",
            "Requirement already satisfied: attrdict in /usr/local/lib/python3.10/dist-packages (from paddleocr) (2.0.1)\n",
            "Requirement already satisfied: Pillow>=10.0.0 in /usr/local/lib/python3.10/dist-packages (from paddleocr) (10.0.1)\n",
            "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from paddleocr) (6.0.1)\n",
            "Requirement already satisfied: python-docx in /usr/local/lib/python3.10/dist-packages (from paddleocr) (1.1.0)\n",
            "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from paddleocr) (4.12.3)\n",
            "Requirement already satisfied: fonttools>=4.24.0 in /usr/local/lib/python3.10/dist-packages (from paddleocr) (4.50.0)\n",
            "Requirement already satisfied: fire>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from paddleocr) (0.6.0)\n",
            "Requirement already satisfied: pdf2docx in /usr/local/lib/python3.10/dist-packages (from paddleocr) (0.5.8)\n",
            "Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from fire>=0.3.0->paddleocr) (1.16.0)\n",
            "Requirement already satisfied: termcolor in /usr/local/lib/python3.10/dist-packages (from fire>=0.3.0->paddleocr) (2.4.0)\n",
            "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->paddleocr) (2.5)\n",
            "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from imgaug->paddleocr) (1.11.4)\n",
            "Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from imgaug->paddleocr) (3.7.1)\n",
            "Requirement already satisfied: imageio in /usr/local/lib/python3.10/dist-packages (from imgaug->paddleocr) (2.31.6)\n",
            "Requirement already satisfied: networkx>=2.2 in /usr/local/lib/python3.10/dist-packages (from scikit-image->paddleocr) (3.2.1)\n",
            "Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.10/dist-packages (from scikit-image->paddleocr) (2024.2.12)\n",
            "Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-image->paddleocr) (1.6.0)\n",
            "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from scikit-image->paddleocr) (24.0)\n",
            "Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.10/dist-packages (from openpyxl->paddleocr) (1.1.0)\n",
            "Requirement already satisfied: PyMuPDF>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from pdf2docx->paddleocr) (1.24.1)\n",
            "Requirement already satisfied: opencv-python-headless>=4.5 in /usr/local/lib/python3.10/dist-packages (from pdf2docx->paddleocr) (4.9.0.80)\n",
            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from python-docx->paddleocr) (4.10.0)\n",
            "Requirement already satisfied: cssselect in /usr/local/lib/python3.10/dist-packages (from premailer->paddleocr) (1.2.0)\n",
            "Requirement already satisfied: cssutils in /usr/local/lib/python3.10/dist-packages (from premailer->paddleocr) (2.10.2)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from premailer->paddleocr) (2.31.0)\n",
            "Requirement already satisfied: cachetools in /usr/local/lib/python3.10/dist-packages (from premailer->paddleocr) (5.3.3)\n",
            "Requirement already satisfied: bce-python-sdk in /usr/local/lib/python3.10/dist-packages (from visualdl->paddleocr) (0.9.6)\n",
            "Requirement already satisfied: flask>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from visualdl->paddleocr) (2.2.5)\n",
            "Requirement already satisfied: Flask-Babel>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from visualdl->paddleocr) (4.0.0)\n",
            "Requirement already satisfied: protobuf>=3.20.0 in /usr/local/lib/python3.10/dist-packages (from visualdl->paddleocr) (3.20.3)\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from visualdl->paddleocr) (2.0.3)\n",
            "Requirement already satisfied: rarfile in /usr/local/lib/python3.10/dist-packages (from visualdl->paddleocr) (4.2)\n",
            "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from visualdl->paddleocr) (5.9.5)\n",
            "Requirement already satisfied: Werkzeug>=2.2.2 in /usr/local/lib/python3.10/dist-packages (from flask>=1.1.1->visualdl->paddleocr) (3.0.2)\n",
            "Requirement already satisfied: Jinja2>=3.0 in /usr/local/lib/python3.10/dist-packages (from flask>=1.1.1->visualdl->paddleocr) (3.1.3)\n",
            "Requirement already satisfied: itsdangerous>=2.0 in /usr/local/lib/python3.10/dist-packages (from flask>=1.1.1->visualdl->paddleocr) (2.1.2)\n",
            "Requirement already satisfied: click>=8.0 in /usr/local/lib/python3.10/dist-packages (from flask>=1.1.1->visualdl->paddleocr) (8.1.7)\n",
            "Requirement already satisfied: Babel>=2.12 in /usr/local/lib/python3.10/dist-packages (from Flask-Babel>=3.0.0->visualdl->paddleocr) (2.14.0)\n",
            "Requirement already satisfied: pytz>=2022.7 in /usr/local/lib/python3.10/dist-packages (from Flask-Babel>=3.0.0->visualdl->paddleocr) (2023.4)\n",
            "Requirement already satisfied: PyMuPDFb==1.24.1 in /usr/local/lib/python3.10/dist-packages (from PyMuPDF>=1.19.0->pdf2docx->paddleocr) (1.24.1)\n",
            "Requirement already satisfied: pycryptodome>=3.8.0 in /usr/local/lib/python3.10/dist-packages (from bce-python-sdk->visualdl->paddleocr) (3.20.0)\n",
            "Requirement already satisfied: future>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from bce-python-sdk->visualdl->paddleocr) (0.18.3)\n",
            "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->imgaug->paddleocr) (1.2.0)\n",
            "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->imgaug->paddleocr) (0.12.1)\n",
            "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->imgaug->paddleocr) (1.4.5)\n",
            "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->imgaug->paddleocr) (3.1.2)\n",
            "Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->imgaug->paddleocr) (2.8.2)\n",
            "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->visualdl->paddleocr) (2024.1)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->premailer->paddleocr) (3.3.2)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->premailer->paddleocr) (3.6)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->premailer->paddleocr) (2.0.7)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->premailer->paddleocr) (2024.2.2)\n",
            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from Jinja2>=3.0->flask>=1.1.1->visualdl->paddleocr) (2.1.5)\n",
            "2.7.3\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 3. 使用PaddleOCR端到端的解决方案完成文字识别\n",
        "备注：样例图片来自互联网，侵权删"
      ],
      "metadata": {
        "id": "OjW47Vyv_qdn"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from paddleocr import PaddleOCR, draw_ocr\n",
        "\n",
        "# Paddleocr目前支持的多语言语种可以通过修改lang参数进行切换\n",
        "# 例如`ch`, `en`, `fr`, `german`, `korean`, `japan`\n",
        "ocr = PaddleOCR(use_angle_cls=True, lang=\"ch\")  # need to run only once to download and load model into memory\n",
        "img_path = 'data/sample.jpg'\n",
        "result = ocr.ocr(img_path, cls=True)\n",
        "for idx in range(len(result)):\n",
        "    res = result[idx]\n",
        "    for line in res:\n",
        "        print(line)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hTybmMNHAPDH",
        "outputId": "604d4c65-b3d8-4d90-bc5e-27d5409284fd"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[2024/04/06 10:45:07] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=True, use_xpu=False, use_npu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, precision='fp32', gpu_mem=500, gpu_id=0, image_dir=None, page_num=0, det_algorithm='DB', det_model_dir='/root/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer', det_limit_side_len=960, det_limit_type='max', det_box_type='quad', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, rec_algorithm='SVTR_LCNet', rec_model_dir='/root/.paddleocr/whl/rec/ch/ch_PP-OCRv4_rec_infer', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_batch_num=6, max_text_length=25, rec_char_dict_path='/usr/local/lib/python3.10/dist-packages/paddleocr/ppocr/utils/ppocr_keys_v1.txt', use_space_char=True, vis_font_path='./doc/fonts/simfang.ttf', drop_score=0.5, e2e_algorithm='PGNet', e2e_model_dir=None, e2e_limit_side_len=768, e2e_limit_type='max', e2e_pgnet_score_thresh=0.5, e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_pgnet_valid_set='totaltext', e2e_pgnet_mode='fast', use_angle_cls=True, cls_model_dir='/root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer', cls_image_shape='3, 48, 192', label_list=['0', '180'], cls_batch_num=6, cls_thresh=0.9, enable_mkldnn=False, cpu_threads=10, use_pdserving=False, warmup=False, sr_model_dir=None, sr_image_shape='3, 32, 128', sr_batch_num=1, draw_img_save_dir='./inference_results', save_crop_res=False, crop_res_save_dir='./output', use_mp=False, total_process_num=1, process_id=0, benchmark=False, save_log_path='./log_output/', show_log=True, use_onnx=False, output='./output', table_max_len=488, table_algorithm='TableAttn', table_model_dir=None, merge_no_span_structure=True, table_char_dict_path=None, layout_model_dir=None, layout_dict_path=None, layout_score_threshold=0.5, layout_nms_threshold=0.5, kie_algorithm='LayoutXLM', ser_model_dir=None, re_model_dir=None, use_visual_backbone=True, ser_dict_path='../train_data/XFUND/class_list_xfun.txt', ocr_order_method=None, mode='structure', image_orientation=False, layout=True, table=True, ocr=True, recovery=False, use_pdf2docx_api=False, invert=False, binarize=False, alphacolor=(255, 255, 255), lang='ch', det=True, rec=True, type='ocr', ocr_version='PP-OCRv4', structure_version='PP-StructureV2')\n",
            "[2024/04/06 10:45:10] ppocr DEBUG: dt_boxes num : 23, elapsed : 0.06136965751647949\n",
            "[2024/04/06 10:45:10] ppocr DEBUG: cls num  : 23, elapsed : 0.03436017036437988\n",
            "[2024/04/06 10:45:10] ppocr DEBUG: rec_res num  : 23, elapsed : 0.13877606391906738\n",
            "[[[162.0, 218.0], [303.0, 218.0], [303.0, 235.0], [162.0, 235.0]], ('证书号第4119418号', 0.9991413950920105)]\n",
            "[[[274.0, 312.0], [572.0, 312.0], [572.0, 346.0], [274.0, 346.0]], ('发明专利证书', 0.999152660369873)]\n",
            "[[[171.0, 395.0], [517.0, 395.0], [517.0, 412.0], [171.0, 412.0]], ('发明名称：一种基于齿轮传动的高精度模具加工装置', 0.9983615279197693)]\n",
            "[[[547.0, 397.0], [646.0, 472.0], [614.0, 515.0], [516.0, 440.0]], ('NIPA', 0.9657814502716064)]\n",
            "[[[171.0, 436.0], [309.0, 436.0], [309.0, 453.0], [171.0, 453.0]], ('发明人：蒋伟达', 0.9948805570602417)]\n",
            "[[[237.0, 479.0], [402.0, 479.0], [402.0, 495.0], [237.0, 495.0]], ('号：ZL201910522666.6', 0.9884902238845825)]\n",
            "[[[172.0, 522.0], [381.0, 522.0], [381.0, 539.0], [172.0, 539.0]], ('专利申请日：2019年06月17日', 0.996185302734375)]\n",
            "[[[171.0, 563.0], [462.0, 564.0], [462.0, 581.0], [171.0, 580.0]], ('专利权人：浙江顺天传动科技股份有限公司', 0.998400866985321)]\n",
            "[[[171.0, 606.0], [188.0, 606.0], [188.0, 624.0], [171.0, 624.0]], ('地', 0.9997441172599792)]\n",
            "[[[237.0, 607.0], [563.0, 607.0], [563.0, 624.0], [237.0, 624.0]], ('址：325100浙江省温州市永嘉县乌牛镇东蒙工业园区', 0.9832388758659363)]\n",
            "[[[171.0, 650.0], [381.0, 650.0], [381.0, 667.0], [171.0, 667.0]], ('授权公告日：2020年11月27日', 0.9983038902282715)]\n",
            "[[[437.0, 650.0], [633.0, 650.0], [633.0, 667.0], [437.0, 667.0]], ('授权公告号：CN110125766B', 0.9963483810424805)]\n",
            "[[[199.0, 682.0], [706.0, 682.0], [706.0, 698.0], [199.0, 698.0]], ('国家知识产权局依照中华人民共和国专利法进行审查，决定授子专利权，颁发发明专利', 0.9750413298606873)]\n",
            "[[[171.0, 703.0], [706.0, 703.0], [706.0, 721.0], [171.0, 721.0]], ('证书并在专利登记簿上予以登记、专利权自疫权公告之日起生效。专利权期限为二十年，自', 0.967517077922821)]\n",
            "[[[172.0, 723.0], [243.0, 723.0], [243.0, 741.0], [172.0, 741.0]], ('申请日起算。', 0.9332730174064636)]\n",
            "[[[199.0, 743.0], [706.0, 743.0], [706.0, 761.0], [199.0, 761.0]], ('专利证书记载专利权登记时的法律状况.专利权的转移、质押、无效、终止、恢复和专', 0.9761695861816406)]\n",
            "[[[172.0, 764.0], [563.0, 764.0], [563.0, 780.0], [172.0, 780.0]], ('利权人的姓名或名称、国籍、地址变更等事项记载在专利登记薄上。', 0.95839524269104)]\n",
            "[[[288.0, 896.0], [443.0, 896.0], [443.0, 965.0], [288.0, 965.0]], ('申雨', 0.9160785675048828)]\n",
            "[[[172.0, 912.0], [210.0, 912.0], [210.0, 932.0], [172.0, 932.0]], ('局长', 0.9996654987335205)]\n",
            "[[[172.0, 939.0], [228.0, 939.0], [228.0, 961.0], [172.0, 961.0]], ('申长雨', 0.9964106678962708)]\n",
            "[[[519.0, 967.0], [641.0, 967.0], [641.0, 985.0], [519.0, 985.0]], ('2020年11月27日', 0.9990251064300537)]\n",
            "[[[370.0, 1017.0], [461.0, 1017.0], [461.0, 1031.0], [370.0, 1031.0]], ('第1页（共2页）', 0.9308853149414062)]\n",
            "[[[359.0, 1057.0], [467.0, 1057.0], [467.0, 1074.0], [359.0, 1074.0]], ('其他事项参见续页', 0.9938048124313354)]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 4. 查看识别结果"
      ],
      "metadata": {
        "id": "8yqET8FUAlGg"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(txts)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zHNzGsNuA3ux",
        "outputId": "65429a06-fb02-4171-99ee-81d0b4e24b51"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "['证书号第4119418号', '发明专利证书', '发明名称：一种基于齿轮传动的高精度模具加工装置', 'NIPA', '发明人：蒋伟达', '号：ZL201910522666.6', '专利申请日：2019年06月17日', '专利权人：浙江顺天传动科技股份有限公司', '地', '址：325100浙江省温州市永嘉县乌牛镇东蒙工业园区', '授权公告日：2020年11月27日', '授权公告号：CN110125766B', '国家知识产权局依照中华人民共和国专利法进行审查，决定授子专利权，颁发发明专利', '证书并在专利登记簿上予以登记、专利权自疫权公告之日起生效。专利权期限为二十年，自', '申请日起算。', '专利证书记载专利权登记时的法律状况.专利权的转移、质押、无效、终止、恢复和专', '利权人的姓名或名称、国籍、地址变更等事项记载在专利登记薄上。', '申雨', '局长', '申长雨', '2020年11月27日', '第1页（共2页）', '其他事项参见续页']\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 5. 构造Prompt (Prompt出处来自飞桨团队，针对GLM4进行了一些改动)"
      ],
      "metadata": {
        "id": "8A2r-toUA9jy"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "ocr_result=txts\n",
        "key='{\\'证书编号\\',\\'发明名称\\',\\'发明人\\',\\'专利号\\',\\'专利申请日\\',\\'专利申请人\\',\\'地址\\'，\\'授权公告日\\',\\'授权公告号\\'}'"
      ],
      "metadata": {
        "id": "gdnf6ji2A4u9"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 6. 查看构造好的Prompt"
      ],
      "metadata": {
        "id": "zL12gY6YBgS9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "prompt = f\"\"\"你现在的任务是从OCR文字识别的结果中提取我指定的关键信息。OCR的文字识别结果使用```符号包围，包含所识别出来的文字，顺序在原始图片中从左至右、从上至下。我指定的关键信息使用[]符号包围。请注意OCR的文字识别结果可能存在长句子换行被切断、不合理的分词、对应错位等问题，你需要结合上下文语义进行综合判断，以抽取准确的关键信息。在返回结果时使用json格式，包含一个key-value对，key值为我指定的关键信息，value值为所抽取的结果。如果认为OCR识别结果中没有关键信息key，则将value赋值为“未找到相关信息”。请只输出json格式的结果，不要包含其它多余文字！下面正式开始：\\n\\nOCR文字：```{ocr_result}```\\n\\n要抽取的关键信息：```{key}```。\\n\\n请只输出json，不要输出额外内容！\\n\\n你的json输出是：\"\"\"\n",
        "print(prompt)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "GA2wPMOQBetx",
        "outputId": "a1354d6e-8f85-454f-fd82-5f162e783be8"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "你现在的任务是从OCR文字识别的结果中提取我指定的关键信息。OCR的文字识别结果使用```符号包围，包含所识别出来的文字，顺序在原始图片中从左至右、从上至下。我指定的关键信息使用[]符号包围。请注意OCR的文字识别结果可能存在长句子换行被切断、不合理的分词、对应错位等问题，你需要结合上下文语义进行综合判断，以抽取准确的关键信息。在返回结果时使用json格式，包含一个key-value对，key值为我指定的关键信息，value值为所抽取的结果。如果认为OCR识别结果中没有关键信息key，则将value赋值为“未找到相关信息”。请只输出json格式的结果，不要包含其它多余文字！下面正式开始：\n",
            "\n",
            "OCR文字：```['证书号第4119418号', '发明专利证书', '发明名称：一种基于齿轮传动的高精度模具加工装置', 'NIPA', '发明人：蒋伟达', '号：ZL201910522666.6', '专利申请日：2019年06月17日', '专利权人：浙江顺天传动科技股份有限公司', '地', '址：325100浙江省温州市永嘉县乌牛镇东蒙工业园区', '授权公告日：2020年11月27日', '授权公告号：CN110125766B', '国家知识产权局依照中华人民共和国专利法进行审查，决定授子专利权，颁发发明专利', '证书并在专利登记簿上予以登记、专利权自疫权公告之日起生效。专利权期限为二十年，自', '申请日起算。', '专利证书记载专利权登记时的法律状况.专利权的转移、质押、无效、终止、恢复和专', '利权人的姓名或名称、国籍、地址变更等事项记载在专利登记薄上。', '申雨', '局长', '申长雨', '2020年11月27日', '第1页（共2页）', '其他事项参见续页']```\n",
            "\n",
            "要抽取的关键信息：```{'证书编号','发明名称','发明人','专利号','专利申请日','专利申请人','地址'，'授权公告日','授权公告号'}```。\n",
            "\n",
            "请只输出json，不要输出额外内容！\n",
            "\n",
            "你的json输出是：\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 7. 使用GLM-4执行信息抽取任务(0 shot)"
      ],
      "metadata": {
        "id": "y_KMqk87BvN-"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install zhipuai\n",
        "from zhipuai import ZhipuAI\n",
        "client = ZhipuAI(api_key=\"\") # 填写您自己的APIKey\n",
        "response = client.chat.completions.create(\n",
        "    model=\"glm-4\",  # 填写需要调用的模型名称\n",
        "    messages=[\n",
        "        {\"role\": \"user\", \"content\": prompt},\n",
        "    ],\n",
        ")\n",
        "print(response.choices[0].message.content)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "T0wLbPjxBpFt",
        "outputId": "2c788936-daf4-4a98-d09f-4b3f53a89dc3"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Requirement already satisfied: zhipuai in /usr/local/lib/python3.10/dist-packages (2.0.1)\n",
            "Requirement already satisfied: httpx>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from zhipuai) (0.27.0)\n",
            "Requirement already satisfied: pydantic>=2.5.2 in /usr/local/lib/python3.10/dist-packages (from zhipuai) (2.6.4)\n",
            "Requirement already satisfied: cachetools>=4.2.2 in /usr/local/lib/python3.10/dist-packages (from zhipuai) (5.3.3)\n",
            "Requirement already satisfied: pyjwt~=2.8.0 in /usr/local/lib/python3.10/dist-packages (from zhipuai) (2.8.0)\n",
            "Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx>=0.23.0->zhipuai) (3.7.1)\n",
            "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx>=0.23.0->zhipuai) (2024.2.2)\n",
            "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx>=0.23.0->zhipuai) (1.0.5)\n",
            "Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx>=0.23.0->zhipuai) (3.6)\n",
            "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx>=0.23.0->zhipuai) (1.3.1)\n",
            "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx>=0.23.0->zhipuai) (0.14.0)\n",
            "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.5.2->zhipuai) (0.6.0)\n",
            "Requirement already satisfied: pydantic-core==2.16.3 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.5.2->zhipuai) (2.16.3)\n",
            "Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.5.2->zhipuai) (4.10.0)\n",
            "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx>=0.23.0->zhipuai) (1.2.0)\n",
            "```json\n",
            "{\n",
            "  \"证书编号\": \"第4119418号\",\n",
            "  \"发明名称\": \"一种基于齿轮传动的高精度模具加工装置\",\n",
            "  \"发明人\": \"蒋伟达\",\n",
            "  \"专利号\": \"ZL201910522666.6\",\n",
            "  \"专利申请日\": \"2019年06月17日\",\n",
            "  \"专利申请人\": \"浙江顺天传动科技股份有限公司\",\n",
            "  \"地址\": \"325100浙江省温州市永嘉县乌牛镇东蒙工业园区\",\n",
            "  \"授权公告日\": \"2020年11月27日\",\n",
            "  \"授权公告号\": \"CN110125766B\"\n",
            "}\n",
            "```\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 8. 对抽取结果进行清洗"
      ],
      "metadata": {
        "id": "dYDvFnHdG0Q9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# 查看结果属性\n",
        "type(response.choices[0].message.content)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ycms_3aRCIrX",
        "outputId": "b6924d89-67b2-4b52-8590-643dda11ebf7"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "str"
            ]
          },
          "metadata": {},
          "execution_count": 65
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# 使用正则表达式匹配{}之间的内容\n",
        "import re\n",
        "pattern = re.compile(r'{.*?}', re.DOTALL)  # 匹配最短的大括号内容，包括换行符\n",
        "matches = pattern.findall(a)\n",
        "matches[0]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 54
        },
        "id": "9qVmoVXQCNEf",
        "outputId": "ab0e29cb-ae04-4c80-b7b0-8eff0133122c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'{\\n  \"证书编号\": \"第4119418号\",\\n  \"发明名称\": \"一种基于齿轮传动的高精度模具加工装置\",\\n  \"发明人\": \"蒋伟达\",\\n  \"专利号\": \"ZL201910522666.6\",\\n  \"专利申请日\": \"2019年06月17日\",\\n  \"专利申请人\": \"浙江顺天传动科技股份有限公司\",\\n  \"地址\": \"325100浙江省温州市永嘉县乌牛镇东蒙工业园区\",\\n  \"授权公告日\": \"2020年11月27日\",\\n  \"授权公告号\": \"CN110125766B\"\\n}'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 61
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import json\n",
        "json_string=matches[0]\n",
        "# 使用json.loads()函数解析JSON字符串\n",
        "parsed_json = json.loads(json_string)\n",
        "# 打印解析后的JSON对象\n",
        "print(parsed_json)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "BYDr-nUrCVTT",
        "outputId": "feebfbd7-30fa-421b-934f-467ebac135d0"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{'证书编号': '第4119418号', '发明名称': '一种基于齿轮传动的高精度模具加工装置', '发明人': '蒋伟达', '专利号': 'ZL201910522666.6', '专利申请日': '2019年06月17日', '专利申请人': '浙江顺天传动科技股份有限公司', '地址': '325100浙江省温州市永嘉县乌牛镇东蒙工业园区', '授权公告日': '2020年11月27日', '授权公告号': 'CN110125766B'}\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# 如果您需要以特定的格式访问数据，例如获取“发明名称”\n",
        "invention_name = parsed_json.get('发明名称')\n",
        "print(invention_name)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "qW1I6Zr2FU_T",
        "outputId": "28fa2a42-acc0-459b-c147-adbf0f73b94b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "一种基于齿轮传动的高精度模具加工装置\n"
          ]
        }
      ]
    }
  ]
}